Skip to content

aws: support for bring-your-own hosted zone#4772

Merged
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
staebler:aws_byo_private_zone
Apr 11, 2021
Merged

aws: support for bring-your-own hosted zone#4772
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
staebler:aws_byo_private_zone

Conversation

@staebler
Copy link
Copy Markdown
Contributor

Add the .aws.hostedZone field to the install config to support the user supplying an existing hosted zone for the internal private hosted zone for the cluster. This can only be used when the user is also supplying their own VPC. The hosted zone must already be associated with the user-provided VPC.

Add validation in the "Platform Provisioning Check" asset for the user-provided internal hosted zone. The validation checks that
(1) the hosted zone is associated with the user-provided VPC and that (2) the hosted zone does not contain any record sets for subdomains of the cluster's domain.

The latter of these checks is meant to provide a modicum of protection against the user accidentally trying to install again a cluster that they have already installed. When it comes time to destroy the second failed installation, the destroyer will not be able to tell that the records sets in the hosted zone are actually being used by a different cluster.

When the user provides the private hosted zone to use for the cluster, the destroyer still needs to delete the recordsets for the cluster when the cluster is destroyed. There is no way to tag recordsets, so we must rely on tagging the hosted zone as shared. When the destroyer encounters a hosted zone tagged as shared by the cluster, the destroyer will delete all recordsets in that hosted zone that are strict subdomains of the cluster's domain. The cluster domain is added to the AWS cluster metadata so that it is available to the destroyer.

https://issues.redhat.com/browse/CORS-1666

@staebler
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-shared-vpc

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2021
@staebler staebler force-pushed the aws_byo_private_zone branch from 6676aa8 to 5659a4f Compare March 29, 2021 14:14
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2021
@staebler staebler force-pushed the aws_byo_private_zone branch from 5659a4f to 34aef82 Compare April 5, 2021 15:16
@staebler
Copy link
Copy Markdown
Contributor Author

staebler commented Apr 5, 2021

5659a4fc1...34aef8216

@staebler staebler force-pushed the aws_byo_private_zone branch from 34aef82 to fcf44b7 Compare April 7, 2021 18:13
Copy link
Copy Markdown
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the first three commits and will take a closer look at the destroy code soon.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if -> is

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we continue here. It seems like it is ok to have a record equal to the cluster domain. Is that the case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, every hosted zone has two records which are NS and SOA records with the same domain as the hosted zone. I will add a comment about this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is invalid the right type of error here? Perhaps InternalError would be more suitable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense.

Comment thread pkg/destroy/aws/aws.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment does not match function name

Copy link
Copy Markdown
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly sane to me with a few nits and a couple of questions, especially regarding the trailing dot. But it looks good. I may try to take another closer look as I don't have a lot of background in AWS destroy.

Comment thread pkg/destroy/aws/aws.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure + in the verb %#+v is doing anything. From fmt:

%v	the value in a default format
	when printing structs, the plus flag (%+v) adds field names
%#v	a Go-syntax representation of the value

https://play.golang.org/p/-LjA46_hvi3

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This was copied from

lastError = errors.Wrapf(err, "deleting record set %#+v from zone %s", recordSet, id)
. I will remove it here.

Comment thread pkg/destroy/aws/shared.go Outdated
Comment on lines 23 to 37
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (or maybe more of a rant): I realize (now) this code is not original to this PR, but this seems like a convoluted way of doing:

key := "kubernetes.io/cluster/" + clusterID
 o.removeSharedTag(ctx, session, tagClients, key, tracker);`

Perhaps this is more future proof. Let's not fix what's not broken, just ranting or maybe I'm missing something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bit of code that always takes me too long to understand every time that I look at it. I don't know that we want to rely on the clusterID being set correctly, although I can't think of a good reason why it wouldn't be.

Maybe it would be sufficient to have a function that gets the cluster tag keys from the filter.

func clusterOwnedKeys(filters []Filter) []string {
	var keys[] string
	for _, filter := range filters {
		for key, value := range filter {
			if !strings.HasPrefix(key, "kubernetes.io/cluster/") {
				continue
			}
			if value != "owned" {
				continue
			}
			keys = append(keys, key)
		}
	}
	return keys
}

And then the removeSharedTags function could be the following.

func removeSharedTags(ctx context.Context, tagClients []*resourcegroupstaggingapi.ResourceGroupsTaggingAPI, filters []Filter, logger logrus.FieldLogger) error {
	for _, key := range clusterOwnedKeys(filter) {
		if err := removeSharedTag(ctx, tagClients, key, logger); err != nil {
			return err
		}
	}
	return nil
}

Comment thread pkg/destroy/aws/shared.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be capitalized: nothing -> Nothing?
Or is that not necessary because it is WithField?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be capitalized.

Comment thread pkg/destroy/aws/shared.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about caps: no -> No

Comment thread pkg/destroy/aws/shared.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the name of the record set dotted or is it the the value of the record that is dotted? When I look in the GUI it looks like the name isn't dotted but the value is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the record set is dotted.
For example,

        {
            "Name": "api-int.ewolinetz3.devcluster.openshift.com.",
            "Type": "A",
            "AliasTarget": {
                "HostedZoneId": "ZLMOA37VPKANP",
                "DNSName": "ewolinetz3-rpcfk-int-42decb5de56b9b0b.elb.us-east-2.amazonaws.com.",
                "EvaluateTargetHealth": false
            }
        },

Comment thread pkg/destroy/aws/shared.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above about %#+v verb

Comment thread pkg/destroy/aws/shared.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above about %#+v verb

Comment thread pkg/destroy/aws/shared.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add in the error message for the public record that we skipped deleting the private record?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The destroyer should try again later to delete the records.

staebler added 3 commits April 8, 2021 16:53
Add the `.aws.hostedZone` field to the install config to support
the user supplying an existing hosted zone for the internal
private hosted zone for the cluster. This can only be used when
the user is also supplying their own VPC. The hosted zone must
already be associated with the user-provided VPC.

https://issues.redhat.com/browse/CORS-1666
When the user provides an existing hosted zone to use for the
cluster's private zone, the dns.config.openshift.io resource
needs to be adjusted so that the operator can locate the
hosted zone. Since the hosted zone already exists, we can specify
the ID of the hosted zone rather than relying on tags.
Add validation in the "Platform Provisioning Check" asset for the
user-provided internal hosted zone. The validation checks that
(1) the hosted zone is associated with the user-provided VPC and that
(2) the hosted zone does not contain any record sets for subdomains
of the cluster's domain.

The latter of these checks is meant to provide a modicum of protection
against the user accidentally trying to install again a cluster that
they have already installed. When it comes time to destroy the second
failed installation, the destroyer will not be able to tell that the
records sets in the hosted zone are actually being used by a
different cluster.
@staebler staebler force-pushed the aws_byo_private_zone branch from fcf44b7 to 9bc4190 Compare April 8, 2021 21:10
@staebler
Copy link
Copy Markdown
Contributor Author

staebler commented Apr 8, 2021

@patrickdillon I have addressed your feedback.

When the user provides the private hosted zone to use for the cluster,
the destroyer still needs to delete the recordsets for the cluster
when the cluster is destroyed. There is no way to tag recordsets, so
we must rely on tagging the hosted zone as shared. When the destroyer
encounters a hosted zone tagged as shared by the cluster, the destroyer
will delete all recordsets in that hosted zone that are strict
subdomains of the cluster's domain. The cluster domain is added to the
AWS cluster metadata so that it is available to the destroyer.
@staebler staebler force-pushed the aws_byo_private_zone branch from 9bc4190 to cf18b17 Compare April 8, 2021 21:21
@patrickdillon
Copy link
Copy Markdown
Contributor

I'm having a hard time reasoning through this destroy code in the context of avoiding deleting non-cluster records. I may also be hitting some limits in my knowledge.

I need to revisit the code & will do that immediately but we're in a bit of a time crunch so I want to write this out to aid the process/discussion.

What about this scenario (I think this would pass pre-provision check, but let me know if something else makes this invalid):

We have a shared hosted zone: shared.devcluster.openshift.com

We install a cluster with the cluster domain shared.devcluster.openshift.com

We install a second cluster with the cluster domain foo.shared.devcluster.openshift.com

We delete the first cluster with cluster domain shared.devcluster.openshift.com doesn't that delete the records for the second cluster?

If so, I am thinking we should lean toward deleting whitelisted records in this shared scenario (e.g. only delete api + clusterdomain, etc) I'm sure you've considered this and I imagine the downside is that IF the cluster creates non-whitelisted records they will be leaked, but that's not the case ATM right?

@patrickdillon
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2021
@staebler
Copy link
Copy Markdown
Contributor Author

staebler commented Apr 9, 2021

For posterity, since this was discussed outside of this PR,

I'm having a hard time reasoning through this destroy code in the context of avoiding deleting non-cluster records. I may also be hitting some limits in my knowledge.

I need to revisit the code & will do that immediately but we're in a bit of a time crunch so I want to write this out to aid the process/discussion.

What about this scenario (I think this would pass pre-provision check, but let me know if something else makes this invalid):

We have a shared hosted zone: shared.devcluster.openshift.com

We install a cluster with the cluster domain shared.devcluster.openshift.com

We install a second cluster with the cluster domain foo.shared.devcluster.openshift.com

We delete the first cluster with cluster domain shared.devcluster.openshift.com doesn't that delete the records for the second cluster?

Yes that will delete the records for the second cluster. The contention here is that it is a mis-configuration to have the second cluster use a cluster domain that is a sub domain of the first cluster. The first cluster owns its entire cluster domain. We cannot check for this on pre-provision because there is no way to (reliably) tell if a cluster with a parent domain is in the hosted zone already.

If so, I am thinking we should lean toward deleting whitelisted records in this shared scenario (e.g. only delete api + clusterdomain, etc) I'm sure you've considered this and I imagine the downside is that IF the cluster creates non-whitelisted records they will be leaked, but that's not the case ATM right?

A whitelist is problematic because record sets can be created by in-cluster components with names that the installer does not know about. The installer only creates the api and api-int record sets. For example, the *.apps record set is created in-cluster.

@staebler
Copy link
Copy Markdown
Contributor Author

staebler commented Apr 9, 2021

/approve

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 9, 2021
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 11, 2021

@staebler: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt cf18b17 link /test e2e-libvirt

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@staebler
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-4.7

@openshift-cherrypick-robot
Copy link
Copy Markdown

@staebler: #4772 failed to apply on top of branch "release-4.7":

Applying: aws: support for bring-your-own hosted zone
Applying: assets: use aws byo hosted zone in dns config
Applying: aws: add provision checks for byo hosted zone
Using index info to reconstruct a base tree...
M	pkg/asset/installconfig/aws/validation.go
M	pkg/asset/installconfig/platformprovisioncheck.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/asset/installconfig/platformprovisioncheck.go
CONFLICT (content): Merge conflict in pkg/asset/installconfig/platformprovisioncheck.go
Auto-merging pkg/asset/installconfig/aws/validation.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 aws: add provision checks for byo hosted zone
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants